/var/folders/g7/sfc5tly50013vn_cy1c842180000gn/T/ipykernel_17312/1469011101.py:26: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
# Handle Missing Valuesimport missingno as msno, matplotlib.pyplot as pltmsno.heatmap(df)plt.title("Missing Values Heatmap")plt.show()df.dropna(thresh=len(df) *0.5, axis=1, inplace=True)if"SALARY_DISPLAY"in df.columns: df["SALARY_DISPLAY"].fillna(df["SALARY_DISPLAY"].median(), inplace=True)for col in df.select_dtypes(include="object").columns: df[col].fillna("Unknown", inplace=True)# Remove Duplicatessubset_cols = [c for c in ["TITLE","COMPANY_NAME","LOCATION","POSTED"] if c in df.columns]if subset_cols: before =len(df) df.drop_duplicates(subset=subset_cols, keep="first", inplace=True)print(f"Removed {before -len(df)} duplicates using {subset_cols}")# Exploratory Data Analysis (EDA)# Job Postings by Industry (Top 15)import plotly.express as pxcounts = ( df["INDUSTRY_DISPLAY"] .value_counts(dropna=False) .head(15) .reset_index(name="Count") .rename(columns={"index": "Industry"}) .sort_values("Count"))fig1 = px.bar( counts, x="Count", y="INDUSTRY_DISPLAY", orientation="h", title="Top 15 Industries by Number of Job Postings")fig1.show()
/var/folders/g7/sfc5tly50013vn_cy1c842180000gn/T/ipykernel_17312/2976636553.py:11: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
/var/folders/g7/sfc5tly50013vn_cy1c842180000gn/T/ipykernel_17312/2976636553.py:14: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
Removed 3300 duplicates using ['TITLE', 'COMPANY_NAME', 'LOCATION', 'POSTED']
# Salary Distribution by Industry (Top 15)sdf = df[["INDUSTRY_DISPLAY","SALARY_DISPLAY"]].copy()sdf = sdf.dropna()sdf = sdf[sdf["SALARY_DISPLAY"] >0]top_industries = sdf["INDUSTRY_DISPLAY"].value_counts().head(15).indexsdf = sdf[sdf["INDUSTRY_DISPLAY"].isin(top_industries)]fig2 = px.box( sdf, x="INDUSTRY_DISPLAY", y="SALARY_DISPLAY", title="Salary Distribution by Industry (Top 15)", points=False)fig2.update_layout(xaxis_tickangle=-45)fig2.show()# Remote vs. On-Site Jobsif"REMOTE_TYPE_NAME"in df.columns: rc = df["REMOTE_TYPE_NAME"].value_counts().reset_index() rc.columns = ["Remote Type","Count"] fig3 = px.pie( rc, names="Remote Type", values="Count", title="Remote vs. On-Site Job Distribution" ) fig3.show()
2 EDA: Rationale & Insights
2.1 Job Postings by Industry
Why: Highlights sectors where demand is concentrated, showing which industries are actively hiring. Key Insights: The top three industries by job postings are Temporary Help Services, Miscellaneous Ambulatory Health Care Services, and Semiconductor and Related Device Manufacturing.
2.2 Salary Distribution by Industry
Why: Shows where negotiation power exists and highlights industries paying well. Key Insights: Automotive Parts and Accessories Retailers show a wide range (negotiation potential), while Barber Shops show a narrow range (little negotiation).
2.3 Remote vs. On-Site Jobs
Why: Workplace flexibility is a major factor in today’s job market. Key Insights: Most postings (78.3%) don’t specify remote status. About 17% are remote, 3.1% hybrid, and 1.6% explicitly not remote.